Semantic Wrappers for Semi-Structured Data Extraction
نویسندگان
چکیده
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide some experimental results to show that both the rule generation process and the preprocessing task are fast and reliable.
منابع مشابه
Xtractor: A light wrapper for XML paragraph-centric documents
The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the ...
متن کاملA Fuzzy Approach for Pertinent Information Extraction from Web Resources
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages. For suitable regular domains, existing wrapper induction algorithms can efficientl...
متن کاملAnti-Unification Based Learning of T-Wrappers for Information Extraction
We present a method for learning wrappers for multi-slot extraction from semi-structured documents. The presented method learns how to construct automatically wrappers from positive examples, consisting of text tuples occurring in the document. These wrappers (T-wrappers) are based on a feature structure unification based pattern language for information extraction. The presented technique is a...
متن کاملLearning T-Wrappers for Information Extraction
We present a method for learning wrappers for multi-slot extraction from semi-structured documents. The presented method learns how to construct automatically wrappers from positive examples, consisting of text tuples occurring in the document. These wrappers (T-wrappers) are based on a feature structure unification based pattern language for information extraction. The presented technique is a...
متن کاملAutomatically Regenerating Wrappers for Web Sources Using Results from Previous Queries
A substantial subset of the web data follows some kind of underlying structure. Nevertheless, HTML does not contain any schema or semantic information about the data it represents. A program able to provide software applications with a structured view of those semi-structured web sources is usually called a wrapper. Wrappers are able to accept a query against the source and return a set of stru...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008